Formally, Earth Mover's Distance (EMD) can be stated in terms of a linear programming problem: two distributions represented by signatures, \(P=\left\{\left(p_{1}, w_{p}\right\}, \dots,\left(p_{m}, w_{p m}\right)\right\}\) and \(Q=\left\{\left(q_{1}, w_{q}\right\}, \ldots,\left(q_{n}, w_{q n}\right)\right\}\) where \(p_i\),\(q_i\) are bin centroids with frequencies \(w_{pi},w_{qi}\), and \(D = [d_{ij}]\) the matrix containing the Euclidean distances between \(p_i\) and \(q_j\) for all \(i,j\). We ensure that \(P\) and \(Q\) have the same total mass of unity (equal to 1) by normalizing each of the two distributions. Next, we find a flow \(F = [f_{ij}]\) between \(p_i\) and \(q_j\) that minimizes the total cost:
\[\operatorname{cost}(P, Q, F)=\sum_{i=1}^{m} \sum_{j=1}^{n} d_{i j} f_{i j}\] (1)
subject to the following constraints:
\[f_{i j} \geq 0 \quad 1 \leq i \leq m, 1 \leq j \leq n\] (2)
\[\sum_{j=1}^{n} f_{i j}=w_{p_{n}} \quad 1 \leq i \leq m\] (3)
\[\sum_{i=1}^{m} f_{i j}=w_{q_{j}} \quad 1 \leq j \leq n\] (4)
\[\sum_{i=1}^{m} \sum_{j=1}^{n} f_{i j}=\sum_{i=1}^{m} w_{p_{i}}=\sum_{j=1}^{n} w_{q_{j}}=1\] (5)
Constraint (2) ensures that mass is only transported in one direction (e.g., from the source sample to the destination sample). Constraints (3) and (4) limit the amount of mass that can be moved from/to a given signature bin to their respective weights; and, constraint (5) ensures that the amount of mass moved does not exceed the maximum possible amount.
In the case of signatures with the same total mass, EMD is a true metric for distributions and is equivalent to the Mallow’s distance [21] as demonstrated by Levina and Bickel [22]. Thus, when applied to probability distributions, EMD has a clear probabilistic interpretation as the Mallow’s distance (in the applications described here, we ensure equal mass of two samples but retain the EMD notation.).
Solving the above linear programming problem determines the optimal flow, \(F\), between the source and destination signatures subject to constraints (2–5). Then EMD is defined as a function of the optimal flow \(F = [f_ij]\) and the ground distance \(D = [d_{ij}]\) [22]: \[E M D(P, Q)=\frac{\sum_{i=1}^{m} \sum_{j=1}^{n} d_{i j} f_{i j}}{\sum_{i=1}^{m} \sum_{j=1}^{n} f_{i j}}\]
(Taken directly from Orlova et al. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0151859)
Figure: Schematic of EMD (https://sbl.inria.fr/doc/group__Earth__mover__distance-package.html).
First, we load Darwin data (from https://github.com/brorfred/ocean_clustering/) at the second coarsest resolution (4 degrees).
## Read in data
datadir = "/home/shyun/Dropbox/research/usc/ocean-provinces/omd/data"
filenames = c("tabulated_darwin_montly_clim_045_090_ver_0_2.csv",
"tabulated_darwin_montly_clim_090_180_ver_0_2.csv",
"tabulated_darwin_montly_clim_180_360_ver_0_2.csv",
"tabulated_darwin_montly_clim_360_720_ver_0_2.csv",
"tabulated_geospatial_montly_clim_045_090_ver_0_2.csv",
"tabulated_geospatial_montly_clim_090_180_ver_0_2.csv",
"tabulated_geospatial_montly_clim_180_360_ver_0_2.csv",
"tabulated_geospatial_montly_clim_360_720_ver_0_2.csv")
dat = read.csv(file.path(datadir, filenames[1]))
Then, we visualize the entire chlorophyll data.
## Visualize data
mydat = dat[which(dat[,"month"]==1),]
colfun = colorRampPalette(c("blue", "red"))
## colfun = colorRampPalette(c(rgb(0,0,1,0.1), rgb(1,0,0,0.1)))
newmap <- getMap()
map("world", fill=TRUE, col="white", bg="white", ylim=c(-90, 90), mar=c(0,0,0,0))
par(mar=c(0,0,0,0))
## plot(newmap)##, xlim = c(-20, 59), ylim = c(35, 71), asp = 1)
centers = sample(mydat$Chl, 5)
points(mydat[ ,c("lon", "lat")], pch=15, cex=4,
col=colfun(20)[as.numeric(cut(mydat$Chl,breaks = 20))])
map("world", fill=TRUE, col="white", bg="white", ylim=c(-90, 90), mar=c(0,0,0,0), add=TRUE)
lat = 19.8968
lon = -155.5828
boxsize = 30
lonrange = lon + c(-1,1)*boxsize
latrange = lat + c(-1,1)*boxsize
rect(lonrange[1], latrange[1], lonrange[2], latrange[2],
border = c("white"), lwd=3)
## par(mar=c(0,0,0,0))
## plot(newmap)##, xlim = c(-20, 59), ylim = c(35, 71), asp = 1)
## centers = sample(mydat$Chl, 5)
## points(mydat[ ,c("lon", "lat")], pch=15, cex=2,
## col = kmeans(mydat$Chl, centers)$cluster)
Now, zoom into the smaller region in the box (near Hawaii).
dat = read.csv(file.path(datadir, filenames[2]))
## Visualize data
mydat = dat[which(dat[,"month"]==1),]
lat = 19.8968
lon = -155.5828
boxsize = 30
lonrange = lon + c(-1,1)*boxsize
latrange = lat + c(-1,1)*boxsize
## plot(newmap,
## xlim = lonrange,
## ylim = latrange,
## asp = 1,
## lwd = 3)
map("world", fill=TRUE, col="white", bg="white",
ylim=latrange, xlim=lonrange, mar=c(0,0,0,0))
points(mydat[ ,c("lon", "lat")], pch=15, cex=8.75,
col=colfun(20)[as.numeric(cut(mydat$Chl,breaks = 20))])
map("world", fill=TRUE, col="white", bg="white",
ylim=latrange, xlim=lonrange, mar=c(0,0,0,0), add=TRUE)
First, visualize the January and February chlorophyll data (Darwin).
Values are all normalized to sum to 1.
## Jan
mydat = dat[which(dat[,"month"]==1),]
jan.dat = mydat[which(lonrange[1] < mydat[,"lon"] & mydat[, "lon"] < lonrange[2] &
latrange[1] < mydat[,"lat"] & mydat[, "lat"] < latrange[2]),]
jan.mat = make_mat(jan.dat)
jan.mat = jan.mat/sum(jan.mat)
drawmat_precise(jan.mat, main="Jan, Darwin")
## Feb
mydat = dat[which(dat[,"month"]==2),]
feb.dat = mydat[which(lonrange[1] < mydat[,"lon"] & mydat[, "lon"] < lonrange[2] &
latrange[1] < mydat[,"lat"] & mydat[, "lat"] < latrange[2]),]
feb.mat = make_mat(feb.dat)
feb.mat = feb.mat/sum(feb.mat)
drawmat_precise(feb.mat, main="Feb, Darwin")
Now, we calculate optimal transport and make a "vector map" of mass movement.
## Get optimal transport
res <- transport(pgrid(jan.mat), pgrid(feb.mat), p=2,
method="aha")
## Plot both
plot(pgrid(jan.mat), pgrid(feb.mat), res,
## Style options
mass = "thickness", acol = rgb(0,0,1,0.5),
rot = TRUE)
title(main="Jan to Feb, Darwin")
Now, we highlight these arrows for four ranges (0-25%, 25%-50%, 50%-75%, 75%-100%) of the magnitude of the mass transfer.
## Separate them by magnitude of the mass transfer
nbin = 4
## cutoffs = seq(from = min(res$mass), to = max(res$mass), length = nbin+1)
cutoffs = quantile(res$mass)
par(mfrow=c(2,2))
for(ii in 1:nbin){
massrange = cutoffs[ii:(ii+1)]
## cols = rep(rgb(0,0,1,0.2), length(res$mass))
cols = rep(NA, length(res$mass))
cols[which(massrange[1] < res$mass & res$mass < massrange[2])] = rgb(1,0,0, 0.7)
plot(pgrid(jan.mat), pgrid(feb.mat), res,
## Style options
mass="thickness", acol=cols,
rot = TRUE)
title(main = paste("Magnitude", c("Lowest", "Second to lowest",
"Second to highest", "Highest")[ii]),
cex.main = 3)
}
From the optimal transport, we directly know the scalar earthmovers distance, by the formula:
\[E M D(P, Q)=\frac{\sum_{i=1}^{m} \sum_{j=1}^{n} d_{i j} f_{i j}}{\sum_{i=1}^{m} \sum_{j=1}^{n} f_{i j}}\]
all.mats = list()
all.pmats = list()
for(ii in 1:12){
## One month's data
mydat = dat[which(dat[,"month"]==ii),]
one.dat = mydat[which(lonrange[1] < mydat[,"lon"] & mydat[, "lon"] < lonrange[2] &
latrange[1] < mydat[,"lat"] & mydat[, "lat"] < latrange[2]),]
one.mat = make_mat(one.dat)
one.mat = one.mat/sum(one.mat)
all.mats[[ii]] = one.mat
all.pmats[[ii]] = pgrid(one.mat)
}
Now, we obtain all twelve months' Darwin data and visualize optimal transports in adjacent months (Jan to Feb, Feb to March, and so on). Note, I highlighed the top 10% high-mass transfers in red.
par(mfrow=c(6,2))
for(ii in 1:11){
pgrid1 = pgrid(all.mats[[ii]])
pgrid2 = pgrid(all.mats[[ii+1]])
res <- transport(pgrid1, pgrid2, p=2,
method="aha")
## nbin = 2
## cutoffs = seq(from = min(res$mass), to = max(res$mass), length = nbin+1)
cutoffs = quantile(res$mass, probs = c(0.9,1))
jj = 1
massrange = cutoffs[jj:(jj+1)]
cols = rep(rgb(0,0,1,0.2), length(res$mass))
cols[which(massrange[1] < res$mass & res$mass < massrange[2])] = rgb(1,0,0,0.7)
plot(pgrid1, pgrid2, res,
## Style options
mass="thickness", acol=cols,
## lwd=0.1,
rot=TRUE, length=0.2)
title(main = paste0( month.abb[ii], " to ", month.abb[ii+1]))
}